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1  Introduction 

The  amount  of  data  of  all  kinds  available  electronically  has  increased  dramat¬ 
ically  in  recent  years.  The  data  resides  in  different  forms,  ranging  from  un¬ 
structured  data  in  file  systems  to  highly  structured  in  relational  database  sys¬ 
tems.  Data  is  accessible  through  a  variety  of  interfaces  including  Web  browsers, 
database  query  languages,  application-specific  interfaces,  or  data  exchange  for¬ 
mats.  Some  of  this  data  is  raw  data,  e.g.,  images  or  sound.  Some  of  it  has  struc¬ 
ture  even  if  the  structure  is  often  implicit,  and  not  as  rigid  or  regular  as  that 
found  in  standard  database  systems.  Sometimes  the  structure  exists  but  has  to 
be  extracted  from  the  data.  Sometimes  also  it  exists  but  we  prefer  to  ignore  it  for 
certain  purposes  such  as  browsing.  We  call  here  semi- structured  data  this  data 
that  is  (from  a  particular  viewpoint)  neither  raw  data  nor  strictly  typed,  i.e.,  not 
table-oriented  as  in  a  relational  model  or  sorted-graph  as  in  object  databases. 

As  will  seen  later  when  the  notion  of  semi-structured  data  is  more  precisely 
defined,  the  need  for  semi-structured  data  arises  naturally  in  the  context  of  data 
integration,  even  when  the  data  sources  are  themselves  well-structured.  Although 
data  integration  is  an  old  topic,  the  need  to  integrate  a  wider  variety  of  data- 
formats  (e.g.,  SGML  or  ASN.l  data)  and  data  found  on  the  Web  has  brought 
the  topic  of  semi-structured  data  to  the  forefront  of  research. 

The  main  purpose  of  the  paper  is  to  isolate  the  essential  aspects  of  semi- 
structured  data.  We  also  survey  some  proposals  of  models  and  query  languages 
for  semi-structured  data.  In  particular,  we  consider  recent  works  at  Stanford  U. 
and  U.  Penn  on  semi-structured  data.  In  both  cases,  the  motivation  is  found  in 
the  integration  of  heterogeneous  data.  The  “lightweight”  data  models  they  use 
(based  on  labelled  graphs)  are  very  similar. 

As  we  shall  see,  the  topic  of  semi-structured  data  has  no  precise  boundary. 
Furthermore,  a  theory  of  semi-structured  data  is  still  missing.  We  will  try  to 
highlight  some  important  issues  in  this  context. 

The  paper  is  organized  as  follows.  In  Section  2,  we  discuss  the  particularities 
of  semi-structured  data.  In  Section  3,  we  consider  the  issue  of  the  data  structure 
and  in  Section  4,  the  issue  of  the  query  language. 

*  Currently  visiting  the  Computer  Science  Dept.,  Stanford  U.  Work  supported  in  part 
by  CESDIS,  NASA  Goddard  Space  Flight  Center;  by  the  Air  Force  Wright  Labora¬ 
tory  Aeronautical  Systems  Center  under  ARPA  Contract  F33615-93-1-1339,  and  by 
the  Air  Force  Rome  Laboratories  under  ARPA  Contract  F30602-95-C-0119. 
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2  Semi-Structured  Data 


In  this  section,  we  make  more  precise  what  we  mean  by  semi-structured  data, 
how  such  data  arises,  and  emphasize  its  main  aspects. 

Roughly  speaking,  semi-structured  data  is  data  that  is  neither  raw  data,  nor 
very  strictly  typed  as  in  conventional  database  systems.  Clearly,  this  definition 
is  imprecise.  For  instance,  would  a  BibTex  file  be  considered  structured  or  semi- 
structured?  Indeed,  the  same  piece  of  information  may  be  viewed  as  unstructured 
at  some  early  processing  stage,  but  later  become  very  structured  after  some 
analysis  has  been  performed.  In  this  section,  we  give  examples  of  semi-structured 
data,  make  more  precise  this  notion  and  describe  important  issues  in  this  context. 


2.1  Examples 

We  will  often  discuss  in  this  paper  BibTex  files  [Lam94]  that  present  the  ad¬ 
vantage  of  being  more  familiar  to  researchers  than  other  well-accepted  formats 
such  as  SGML  [ISO86]  or  ASN.l  [IS087].  Data  in  BibTex  files  closely  resembles 
relational  data.  Such  a  file  is  composed  of  records.  But,  the  structure  is  not  as 
regular.  Some  fields  may  be  missing.  (Indeed,  it  is  customary  to  even  find  com¬ 
pulsory  fields  missing.)  Other  fields  have  some  meaningful  structure,  e.g.,  author. 
There  are  complex  features  such  as  abbreviations  or  cross  references  that  are  not 
easy  to  describe  in  some  database  systems. 

The  Web  also  provides  numerous  popular  examples  of  semi-structured  data. 
In  the  Web,  data  consists  of  files  in  a  particular  format,  HTML,  with  some  struc¬ 
turing  primitives  such  as  tags  and  anchors.  A  typical  example  is  a  data  source 
about  restaurants  in  the  Bay  Area  (from  the  Palo  Alto  Weekly  newspaper),  that 
we  will  call  Guide.  It  consists  of  an  HTML  file  with  one  entry  per  restaurant 
and  provides  some  information  on  prices,  addresses,  styles  of  restaurants  and 
reviews.  Data  in  Guide  resides  in  files  of  text  with  some  implicit  structure.  One 
can  write  a  parser  to  extract  the  underlying  structure.  However,  there  is  a  large 
degree  of  irregularity  in  the  structure  since  (i)  restaurants  are  not  all  treated  in 
a  uniform  manner  (e.g.,  much  less  information  is  given  for  fast-food  joints)  and 
(ii)  information  is  entered  as  plain  text  by  human  beings  that  do  not  present  the 
standard  rigidity  of  your  favorite  data  loader.  Therefore,  the  parser  will  have  to 
be  tolerant  and  accept  to  fail  parsing  portions  of  text  that  will  remain  as  plain 
text. 

Also,  semi-structured  data  arises  often  when  integrating  several  (possibly 
structured)  sources.  Data  integration  of  independent  sources  has  been  a  popular 
topic  of  research  since  the  very  early  days  of  databases.  (Surveys  can  be  found  in 
[SL90,  LMR90,  Bre90],  and  more  recent  work  on  the  integration  of  heterogeneous 
sources  in  e.g.,  [LR096,  QRS+95,  C+95].)  It  has  gained  a  new  vigor  with  the 
recent  popularity  of  the  Web.  Consider  the  integration  of  car  retailer  databases. 
Some  retailers  will  represent  addresses  as  strings  and  others  as  tuples.  Retailers 
will  probably  use  different  conventions  for  representing  dates,  prices,  invoices, 
etc.  We  should  expect  some  information  to  be  missing  from  some  sources.  (E.g., 
some  retailers  may  not  record  whether  non-automatic  transmission  is  available). 
More  generally,  a  wide  heterogeneity  in  the  organization  of  data  should  be  ex¬ 
pected  from  the  car  retailer  data  sources  and  not  all  can  be  resolved  by  the 
integration  software. 

Semi-structured  data  arises  under  a  variety  of  forms  for  a  wide  range  of  appli¬ 
cations  such  as  genome  databases,  scientific  databases,  libraries  of  programs  and 


more  generally,  digital  libraries,  on-line  documentations,  electronic  commerce.  It 
is  thus  essential  to  better  understand  the  issue  of  querying  semi-structured  data. 


2.2  Main  aspects 

The  structure  is  irregular: 

This  must  be  clear  from  the  previous  discussion.  In  many  of  these  applications, 
the  large  collections  that  are  maintained  often  consist  of  heterogeneous  elements. 
Some  elements  may  be  incomplete.  On  the  other  hand,  other  elements  may  record 
extra  information,  e.g.,  annotations.  Different  types  may  be  used  for  the  same 
kind  of  information,  e.g.,  prices  may  be  in  dollars  in  portions  of  the  database 
and  in  francs  in  others.  The  same  piece  of  information,  e.g.,  an  address,  may  be 
structured  in  some  places  as  a  string  and  in  others  as  a  tuple. 

Modelling  and  querying  such  irregular  structures  are  essential  issues. 

The  structure  is  implicit: 

In  many  applications,  although  a  precise  structuring  exists,  it  is  given  implicitly. 
For  instance,  electronic  documents  consist  often  of  a  text  and  a  grammar  (e.g.,  a 
DTD  in  SGML).  The  parsing  of  the  document  then  allows  one  to  isolate  pieces  of 
information  and  detect  relationships  between  them.  However,  the  interpretation 
of  these  relationships  (e.g.,  SGML  exceptions)  may  be  beyond  the  capabilities  of 
standard  database  models  and  are  left  to  the  particular  applications  and  specific 
tools.  We  view  this  structure  as  implicit  (although  specified  explicitly  by  tags) 
since  (i)  some  computation  is  required  to  obtain  it  (e.g.,  parsing)  and  (ii)  the 
correspondence  between  the  parse-tree  and  the  logical  representation  of  the  data 
is  not  always  immediate. 

It  is  also  sometimes  the  case,  in  particular  for  the  Web,  that  the  documents 
come  as  plain  text.  Some  ad-hoc  analysis  is  then  needed  to  extract  the  structure. 
For  instance,  in  the  Guide  data  source,  the  description  of  restaurant  is  in  plain 
text.  Now,  clearly,  it  is  possible  to  develop  some  analysis  tools  to  recognize  prices, 
addresses,  etc.  and  then  extract  the  structure  of  the  file.  The  issue  of  extracting 
the  structure  of  some  text  (e.g.,  HTML)  is  a  challenging  issue. 

The  structure  is  partial: 

To  completely  structure  the  data  often  remains  an  elusive  goal.  Parts  of  the  data 
may  lack  structure  (e.g.,  bitmaps);  other  parts  may  only  unveil  some  very  sketchy 
structure  (e.g.,  unstructured  text).  Information  retrieval  tools  may  provide  a 
limited  form  of  structure,  e.g.,  by  computing  occurrences  of  particular  words  or 
group  of  words  and  by  classifying  documents  based  on  their  content. 

An  application  may  also  decide  to  leave  large  quantities  of  data  outside  the 
database.  This  data  then  remains  unstructured  from  a  database  viewpoint.  The 
loading  of  this  external  data,  its  analysis,  and  its  integration  to  the  database  have 
to  be  performed  efficiently.  We  may  want  to  also  use  optimization  techniques  to 
only  load  selective  portions  of  this  data,  in  the  style  of  [ACM93].  In  general,  the 
management  and  access  of  this  external  data  and  its  interoperability  with  the 
data  from  the  database  is  an  important  issue. 

Indicative  structure  vs.  constraining  structure: 

In  standard  database  applications,  a  strict  typing  policy  is  enforced  to  protect 
data.  We  are  concerned  here  with  applications  where  such  strict  policy  is  often 
viewed  as  too  constraining.  Consider  for  instance  the  Web.  A  person  developing 
a  personal  Web  site  would  be  reluctant  to  accept  strict  typing  restrictions. 

In  the  context  of  the  Lore  Project  at  Stanford,  the  term  data  guide  was 
adopted  to  emphasize  non-conventional  approaches  to  typing  found  in  most  semi- 


structured  data  applications.  A  schema  (as  in  conventional  databases)  describes 
a  strict  type  that  is  adhered  to  by  all  data  managed  by  the  system.  An  update 
not  conforming  is  simply  rejected.  On  the  other  hand,  a  data  guide  provides  some 
information  about  the  current  type  of  the  data.  It  does  not  have  to  be  the  most 
accurate.  (Accuracy  may  be  traded  in  for  simplicity.)  All  new  data  is  accepted, 
eventually  at  the  cost  of  modifying  the  data  guide. 

A-priori  schema  vs.  a-posteriori  data  guide: 

Traditional  database  systems  are  based  on  the  hypothesis  of  a  fixed  schema  that 
has  to  be  defined  prior  to  introducing  any  data.  This  is  not  the  case  for  semi- 
structured  data  where  the  notion  of  schema  is  often  posterior  to  the  existence 
of  data. 

Continuing  with  the  Web  example,  when  all  the  members  of  an  organization 
have  a  Web  page,  there  is  usually  some  pressure  to  unify  the  style  of  these 
home-pages,  or  at  least  agree  on  some  minimal  structure  to  facilitate  the  design 
of  global  entry-points.  Indeed,  it  is  a  general  pattern  for  large  Web  sources  to 
start  with  a  very  loose  structure  and  then  acquire  some  structure  when  the  need 
for  it  is  felt. 

Further  on,  we  will  briefly  mention  issues  concerning  data  guides. 

The  schema  is  very  large: 

Often  as  a  consequence  of  heterogeneity,  the  schema  would  typically  be  quite 
large.  This  is  in  contrast  with  relational  databases  where  the  schema  was  ex¬ 
pected  to  be  orders  of  magnitude  smaller  than  the  data.  For  instance,  suppose 
that  we  are  interested  in  Californian  Impressionist  Painters.  We  may  find  some 
data  about  these  painters  in  many  heterogeneous  information  sources  on  the 
Web,  so  the  schema  is  probably  quite  large.  But  the  data  itself  is  not  so  large. 

Note  that  as  a  consequence,  the  user  is  not  expected  to  know  all  the  details  of 
the  schema.  Thus,  queries  over  the  schema  are  as  important  as  standard  queries 
over  the  data.  Indeed,  one  cannot  separate  anymore  these  two  aspects  of  queries. 
The  schema  is  ignored: 

Typically,  it  is  useful  to  ignore  the  schema  for  some  queries  that  have  more  of  a 
discovery  nature.  Such  queries  may  consist  in  simply  browsing  through  the  data 
or  searching  for  some  string  or  pattern  without  any  precise  indication  on  where  it 
may  occur.  Such  searching  or  browsing  are  typically  not  possible  with  SQL-like 
languages.  They  pose  new  challenges:  (i)  the  extension  of  the  query  languages; 
and  (ii)  the  integration  of  new  optimization  techniques  such  as  full-text  indexing 
[ACC+96]  or  evaluation  of  generalized  path  expressions  [CCM96]. 

The  schema  is  rapidly  evolving: 

In  standard  database  systems,  the  schema  is  viewed  as  almost  immutable,  schema 
updates  as  rare,  and  it  is  well-accepted  that  schema  updates  are  very  expensive. 

Now,  in  contrast,  consider  the  case  of  genome  data  [DOB95].  The  schema  is 
expected  to  change  quite  rapidly,  at  the  same  speed  as  experimental  techniques 
are  improved  or  novel  techniques  introduced.  As  a  consequence,  expressive  for¬ 
mats  such  as  ASN.l  or  ACeDB  [TMD92]  were  preferred  to  a  relational  or  object 
database  system  approach.  Indeed,  the  fact  that  schema  evolves  very  rapidly  is 
often  given  as  the  reason  for  not  using  database  systems  in  applications  that  are 
managing  large  quantities  of  data.  (Other  reasons  include  the  cost  of  database 
systems  and  the  interoperability  with  other  systems,  e.g.,  Fortran  libraries.) 

In  the  context  of  semi-structured  data,  we  have  to  assume  that  the  schema  is 
very  flexible  and  can  be  updated  as  easily  as  data  which  poses  serious  challenges 
to  database  technology. 

The  type  of  data  elements  is  eclectic: 


Another  aspect  of  semi-structured  data  is  that  the  structure  of  a  data  element 
may  depend  on  a  point  of  view  or  on  a  particular  phase  in  the  data  acquisition 
process.  So,  the  type  of  a  piece  of  information  has  to  be  more  eclectic  as,  say  in 
standard  database  systems  where  the  structure  of  a  record  or  that  of  an  object 
is  very  precise.  For  instance,  an  object  can  be  first  a  file.  It  may  become  a 
BibTex  file  after  classification  using  a  tool  in  the  style  of  [TPL95].  It  may  then 
obtain  owner,  creation- date,  and  other  fields  after  some  information  extraction 
phase.  Finally,  it  could  become  a  collection  of  reference  objects  (with  complex 
structures)  once  it  has  been  parsed.  In  that  respect  also,  the  notion  of  type  is 
much  more  flexible. 

This  is  an  issue  of  objects  with  multiple  roles,  e.g.,  [ABG093]  and  objects  in 
views,  e.g.,  [dSAD94]. 

The  distinction  between  schema  and  data  is  blurred: 

In  standard  database  applications,  a  basic  principle  is  the  distinction  between 
the  schema  (that  describes  the  structure  of  the  database)  and  data  (the  database 
instance).  We  already  saw  that  many  differences  between  schema  and  data  disap¬ 
pear  in  the  context  of  semi-structured  data:  schema  updates  are  frequent,  schema 
laws  can  be  violated,  the  schema  may  be  very  large,  the  same  queries/updates 
may  address  both  the  data  and  schema.  Furthermore,  in  the  context  of  semi- 
structured  data,  this  distinction  may  even  logically  make  little  sense.  For  in¬ 
stance,  the  same  classification  information,  e.g.,  the  sex  of  a  person,  may  be 
kept  as  data  in  one  source  (a  boolean  with  true  for  male  and  false  for  female) 
and  as  type  in  the  other  (the  object  is  of  class  Male  or  Female).  We  are  touching 
here  issues  that  dramatically  complicate  database  design  and  data  restructuring. 


2.3  Some  issues 

To  conclude  this  section,  we  consider  a  little  more  precisely  important  issues  in 
the  context  of  semi-structured  data. 

Model  and  languages  for  semi-structured  data: 

Which  model  should  be  used  to  describe  semi-structured  data  and  to  manipu¬ 
late  this  data?  By  languages,  we  mean  here  languages  to  query  semi-structured 
data  but  also  languages  to  restructure  such  data  since  restructuring  is  essen¬ 
tial  for  instance  to  integrate  data  coming  from  several  sources.  There  are  two 
main  difficulties  (i)  we  have  only  a  partial  knowledge  of  the  structure;  and  (ii) 
the  structure  is  potentially  “deeply  nested”  or  even  cyclic.  This  second  point 
in  particular  defeats  calculi  and  algebras  developed  in  the  standard  database 
context  (e.g.,  relational,  complex  value  algebra)  by  requiring  recursion.  It  seems 
that  languages  such  as  Datalog  (see  [U1189,  AHV94])  although  they  provide  some 
form  of  recursion,  are  not  completely  satisfactory. 

These  issues  will  be  dealt  with  in  more  details  in  the  next  two  sections. 
Extracting  and  using  structure: 

The  general  idea  is,  starting  with  data  with  little  explicit  structure,  to  extract 
structuring  information  and  organize  the  data  to  improve  performance.  To  con¬ 
tinue  with  the  bibliography  example,  suppose  we  have  a  number  of  files  con¬ 
taining  bibliography  references  in  BibTex  and  other  formats.  We  may  want  to 
extract  (in  a  data  warehousing  style)  the  titles  of  the  papers,  lists  of  authors 
and  keywords,  i.e.,  the  most  frequently  accessed  data  that  can  be  found  in  every 
format  for  references,  and  store  them  in  a  relational  database.  Note  that  this 
extraction  phase  may  be  difficult  if  some  files  are  structured  according  to  for¬ 
mats  ignored  by  our  system.  Also,  issues  such  as  duplicate  elimination  have  to 


be  faced.  In  general,  the  issue  of  recognizing  an  object  in  a  particular  state  or 
within  a  sequence  of  states  (for  temporal  data)  is  a  challenging  issue. 

The  relational  database  then  contains  links  to  pieces  of  information  in  the 
files,  so  that  all  data  remains  accessible.  Such  a  structured  layer  on  top  of  a  irreg¬ 
ular  and  less  controlled  layer  of  files,  can  provide  important  gains  in  answering 
the  most  common  queries. 

In  general,  we  need  tools  to  extract  information  from  files  including  classi¬ 
fiers,  parsers,  but  also  software  to  extract  cross  references  (e.g.,  within  a  set  of 
HTML  documents),  information  retrieval  packages  to  obtain  statistics  on  words 
(or  groups  of  words)  occurrences  and  statistics  for  relevance  ranking  and  rele¬ 
vance  feedback.  More  generally,  one  could  envision  the  use  of  general  purpose 
data  mining  tools  to  extract  structuring  information. 

One  can  then  use  the  information  extracted  from  the  files  to  build  a  struc¬ 
tured  layer  above  the  layer  of  more  unformed  data.  This  structured  layer  ref¬ 
erences  the  lower  data  layer  and  yields  a  flexible  and  efficient  access  to  the  in¬ 
formation  in  the  lower  layer  to  provide  the  benefits  of  standard  database  access 
methods.  A  similar  concept  is  called  structured  map  in  [DMRA96]. 

More  ways  to  use  structure:  the  data  guide 

We  saw  that  many  differences  with  standard  databases  come  from  a  very  differ¬ 
ent  approach  to  typing.  We  used  the  term  data  guide  to  stress  the  differences. 
A  similar  notion  is  considered  in  [BDFS97].  Now,  since  there  is  no  schema  to 
view  as  a  constraint  on  the  data,  one  may  question  the  need  for  any  kind  of 
typing  information,  and  for  a  data  guide  in  particular.  A  data  guide  provides  a 
computed  loose  description  of  the  structure  of  data.  For  instance,  in  a  particular 
application,  the  data  guide  may  say  that  persons  possibly  have  ougoing  edges 
labelled  name,  address,  hobby  and  friend,  that  an  address  is  either  a  string, 
but  that  it  may  have  outgoing  edges  labelled  street,  and  zipcode.  This  should  be 
viewed  as  more  or  less  accurate  indications  on  the  kind  of  data  that  is  in  the 
database  at  the  moment. 

It  turns  out  that  there  are  many  reasons  for  using  a  data  guide: 

1.  graphical  query  language-.  Graphical  interfaces  use  the  schema  in  very  es¬ 
sential  ways.  For  instance,  QBE  [Zlo77]  would  present  a  query  frame  that 
consists  of  the  names  of  relations  and  their  attributes.  In  the  context  of  semi- 
structured  data,  one  can  view  the  data  guide  as  an  “encompassing  type”  that 
would  serve  the  role  of  a  type  in  helping  the  user  graphically  express  queries 
or  browse  through  the  data. 

2.  cooperative  answer:  Consider  for  instance  the  mistyping  of  a  label.  This  will 
probably  result  in  a  type  error  in  a  traditional  database  system,  but  not  here 
since  strict  type  enforcement  is  abandoned.  Using  a  data  guide,  the  system 
may  still  explain  why  the  answer  is  empty  (because  such  label  is  absent  from 
the  database. 

3.  query  optimization:  Typing  information  is  very  useful  for  query  optimization. 
Even  when  the  structure  is  not  rigid,  some  knowledge  about  the  type  (e.g., 
presence/absence  of  some  attributes)  can  prove  to  be  essential.  For  instance, 
if  the  query  asks  for  the  Latex  sources  of  some  documents  and  the  data 
guides  indicate  that  some  sources  do  not  provide  Latex  sources,  then  a  call 
to  these  sources  can  be  avoided.  This  is  also  a  place  where  the  system  has  to 
show  some  flexibility.  One  of  the  sources  may  be  a  very  structured  database 
(e.g.,  relational),  and  the  system  should  take  advantage  of  that  structure. 


The  notion  of  the  data  guide  associated  to  some  particular  data  with  vari- 


ous  degrees  of  accuracy,  its  use  for  expressing  and  evaluating  queries,  and  its 
maintenance,  are  important  directions  of  research. 

System  issues: 

Although  this  is  not  the  main  focus  of  the  paper,  we  would  like  to  briefly  list 
some  system  issues.  We  already  mentioned  the  need  for  new  query  optimization 
techniques,  and  for  the  integration  of  optimization  techniques  from  various  fields 
(e.g.,  database  indexes  and  full  text  indexes).  Some  standard  database  system 
issues  such  as  transaction  management,  concurrency  control  or  error  recovery 
have  to  be  reconsidered,  in  particular,  because  the  notion  of  “data  item”  becomes 
less  clear:  the  same  piece  of  data  may  have  several  representations  in  various 
parts  of  the  system,  some  atomic,  some  complex.  Physical  design  (in  particular 
clustering)  is  seriously  altered  in  this  context.  Finally,  it  should  be  observed  that, 
by  nature,  a  lot  of  the  data  will  reside  outside  the  database.  The  optimization 
of  external  data  access  (in  particular,  the  efficient  and  selective  loading  of  file 
data)  and  the  interoperability  with  other  systems  are  therefore  key  issues. 

3  Modeling  Semi-Structured  Data 

A  first  fundamental  issue  is  the  choice  of  a  model:  should  it  be  very  rich  and 
complex,  or  on  the  contrary,  simple  and  lightweight?  We  will  argue  here  that  it 
should  be  both. 

Why  a  lightweight  model?  Consider  accessing  data  over  the  Internet.  If  we 
obtain  new  data  using  the  Web  protocol,  the  data  will  be  rather  unstructured 
at  first.  (Some  protocols  such  as  CORBA  [OMG92]  may  provide  a-priori  more 
structured  data.)  Furthermore,  if  the  data  originates  from  a  new  source  that 
we  just  discovered,  it  is  very  likely  that  it  is  structured  in  ways  that  are  still 
unknown  to  our  particular  systems.  This  is  because  (i)  the  number  of  semantic 
constructs  developers  and  researchers  may  possibly  invent  is  extremely  large  and 
(ii)  the  standardization  of  a  complex  data  model  that  will  encompass  the  needs 
of  all  applications  seems  beyond  reach. 

For  such  novel  structures  discovered  over  the  network,  a  lightweight  data 
model  is  preferable.  Any  data  can  be  mapped  to  this  exchange  model,  and  be¬ 
comes  therefore  accessible  without  the  use  of  specific  pieces  of  software. 

Why  also  a  heavyweight  data  model?  Using  a  lightweight  model  does  not 
preclude  the  use  of  a  compatible,  richer  model  that  allows  the  system  to  take 
advantage  of  particular  structuring  information.  For  instance,  traditional  rela¬ 
tions  with  indexes  will  be  often  imported.  When  using  such  an  indexed  relation, 
ignoring  the  fact  that  this  particular  data  is  a  relation  and  that  it  is  indexed 
would  be  suicidal  for  performance. 

As  we  mentioned  in  the  previous  section,  the  types  of  objects  evolve  based  on 
our  current  knowledge  possibly  from  totally  unstructured  to  very  structured,  and 
a  piece  of  information  will  often  move  from  a  very  rich  structure  (in  the  system 
where  it  is  maintained);  to  a  lightweight  structure  when  exchanged  over  the 
network;  to  a  (possibly  different)  very  rich  structure  when  it  has  been  analyzed 
and  integrated  to  other  pieces  of  information,  ft  is  thus  important  to  dispose  of 
a  flexible  model  allowing  both  a  very  light  and  a  very  rich  structuring  of  data. 

In  this  section,  we  first  briefly  consider  some  components  of  a  rich  model  for 
semi-structured  data.  This  should  be  viewed  as  an  indicative,  non-exhaustive 
list  of  candidate  features.  In  our  opinion,  specific  models  for  specific  application 


domains  (e.g.,  Web  databases  or  genome  databases)  are  probably  more  feasible 
than  an  all-purpose  model  for  semi-structured  data.  Then,  we  present  in  more 
details  the  Object  Exchange  Model  that  is  pursuing  a  minimalist  approach. 


3.1  A  maximalist  approach 

We  next  describe  primitives  that  seem  to  be  required  from  a  semantic  model  to 
allow  the  description  of  semi-structured  data.  Our  presentation  is  rather  sketchy 
and  assumes  knowledge  of  the  ODMG  model.  The  following  primitives  should 
be  considered: 

1.  The  ODMG  model:  the  notions  of  objects,  classes  and  class  hierarchy;  and 
structuring  constructs  such  as  set,  list,  bag,  array  seem  all  needed  in  our 
context. 

2.  Null  values:  these  are  given  lip  service  in  the  relational  and  the  ODMG 
models  and  more  is  needed  here. 

3.  Heterogeneous  collections:  collections  need  often  to  be  heterogeneous  in  the 
semi-structured  setting.  So,  there  is  the  need  for  some  union  types  as  found 
for  instance  in  [AH87]  or  [AK89]. 

4.  Text  with  references:  text  is  an  important  component  for  semi-structured 
information.  Two  important  issues  are  (i)  references  to  portions  of  a  text 
(references  and  citations  in  LaTex),  and  (ii)  references  from  the  text  (HTML 
anchors). 

5.  Eclectic  types:  the  same  piece  of  information  may  be  viewed  with  various 
alternative  structures. 

6.  Version  and  time:  it  is  clear  that  we  are  often  more  concerned  by  querying 
the  recent  changes  in  some  data  source  that  in  examining  the  entire  source. 

No  matter  how  rich  a  model  we  choose,  it  is  likely  that  some  weird  features 
of  a  given  application  or  a  particular  data  exchange  format  will  not  be  covered 
(e.g.,  SGML  exceptions).  This  motivates  the  use  of  an  underlying  minimalist 
data  format. 


3.2  A  minimalist  approach 

In  this  section,  we  present  the  Object  Exchange  Model  (OEM)  [AQM+96],  a 
data  model  particularly  useful  for  representing  semi-structured  data. 

The  model  consists  of  graph  with  labels  on  the  edges.  (In  an  early  version 
of  the  model  [PGMW95],  labels  were  attached  to  vertices  which  leads  to  minor 
differences  in  the  description  of  information  and  in  the  corresponding  query 
languages.)  A  very  similar  model  was  independently  proposed  in  [BDHS96].  This 
seems  to  indicate  that  this  model  indeed  achieves  the  goals  to  be  simple  enough, 
and  yet  flexible  and  powerful  enough  to  allow  describing  semi-structured  data 
found  in  common  data  sources  over  the  net.  A  subtle  difference  is  that  OEM  is 
based  on  the  notion  of  objects  with  object  identity  whereas  [BDHS96]  uses  tree 
markers  and  bisimulation.  We  will  ignore  this  distinction  here. 

Data  represented  in  OEM  can  be  thought  of  as  a  graph,  with  objects  as  the 
vertices  and  labels  on  the  edges.  Entities  are  represented  by  objects.  Each  object 
has  a  unique  object  identifier  (oid)  from  the  type  oid.  Some  objects  are  atomic 
and  contain  a  value  from  one  of  the  disjoint  basic  atomic  types,  e.g.,  integer, 
real,  string,  gif,  html,  audio,  java,  etc.  All  other  objects  are  complex;  their 
value  is  a  set  of  object  references,  denoted  as  a  set  of  (label,  oid)  pairs.  The 


labels  are  taken  from  the  atomic  type  string.  Figure  1  provides  an  example  of 
an  OEM  graph. 

OEM  can  easily  model  relational  data,  and,  as  in  the  ODMG  model,  hier¬ 
archical  and  graph  data.  (Although  the  structure  in  Figure  1  is  almost  a  tree, 
there  is  a  cycle  via  objects  &19  and  &:35.)  To  model  semi-structured  informa¬ 
tion  sources,  we  do  not  insist  that  data  is  as  strongly  structured  as  in  standard 
database  models.  Observe  that,  for  example,  (i)  restaurants  have  zero,  one  or 
more  addresses;  (ii)  an  address  is  sometimes  a  string  and  sometimes  a  complex 
structure;  (iii)  a  zipcode  may  be  a  string  or  an  integer;  (iv)  the  zipcode  occurs 
in  the  address  for  some  and  directly  under  restaurant  for  others;  and  (v)  price 
information  is  sometimes  given  and  sometimes  missing. 


Guide 


We  conclude  this  section  with  two  observations  relating  OEM  to  the  relational 
and  ODMG  models: 

OEM  vs.  relational:  One  can  view  an  OEM  database  as  a  relational  structure 
with  a  binary  relation  VAL(oid,  atomicjvalue)  for  specifying  the  values  of 
atomic  objects  and  a  ternary  relation  MEMBER(oid,  label,  oid)  to  specify  the 
values  of  complex  objects.  This  simple  viewpoint  seems  to  defeat  a  large  part 
of  the  research  on  semi-structured  data.  However,  (i)  such  a  representation  is 
possible  only  because  of  the  presence  of  object  identifiers,  so  we  are  already 
out  of  the  relational  model;  (ii)  we  have  to  add  integrity  constraints  to  the 
relational  structure  (e.g.,  to  prohibit  dangling  references);  and  (iii)  it  is  often 
the  case  that  we  want  to  recover  an  object  together  with  its  subcomponents 
and  this  recursively,  which  is  certainly  a  feature  that  is  out  of  relational 
calculus. 

OEM  vs.  ODMG:  In  the  object  exchange  model,  all  objects  have  the  same 
type,  namely  OEM.  Intuitively,  this  type  is  a  tuple  with  one  field  per  possible 
label  containing  a  set  of  OEM’s.  Based  on  this,  it  is  rather  straightforward 
to  have  a  type  system  that  would  incorporate  the  ODMG  types  and  the 


OEM  type  (see  [AQM+96]).  This  is  a  first  step  towards  a  model  that  would 
integrate  the  minimalist  and  maximalist  approaches. 


4  Querying  and  Restructuring 

In  the  context  of  semi-structured  data,  the  query  language  has  to  be  more  flexible 
than  in  conventional  database  systems.  Typing  should  be  more  liberal  since  by 
nature  data  is  less  regular.  What  should  we  expect  from  a  query  language? 

1.  standard  database-style  query  primitives; 

2.  navigation  in  the  style  of  hypertext  or  Web-style  browsing; 

3.  searching  for  pattern  in  an  information-retrieval-style  [Rie79]; 

4.  temporal  queries,  including  querying  versions  or  querying  changes  (an  issue 
that  we  will  ignore  further  on); 

5.  querying  both  the  data  and  the  type/schema  in  the  same  query  as  in  [KL89]. 

Also,  the  language  should  have  sound  theoretical  foundations,  possibly  a  logic 
in  the  style  of  relational  calculus.  So,  there  is  a  need  for  more  works  on  calculi 
for  semi-structured  data  and  algebraizations  of  these  calculi. 

All  this  requires  not  only  revisiting  the  languages  but  also  database  opti¬ 
mization  techniques,  and  in  particular,  integrating  these  techniques  with  op¬ 
timization  techniques  from  information  retrieval  (e.g.,  full  text  indexing)  and 
new  techniques  for  dealing  with  path  expressions  and  more  general  hypertext 
features. 

There  has  been  a  very  important  body  of  literature  on  query  languages  from 
various  perspectives,  calculus,  algebra,  functional,  and  deductive  (see  [U1189, 
AHV94]),  concerning  very  structured  data.  A  number  of  more  recent  proposals 
concern  directly  semi-structured  data.  These  are  most  notably  Lorel  [AQM+96] 
for  the  OEM  model  and  UnQL  [BDHS96]  for  a  very  similar  model.  Although 
developed  with  different  motivations,  languages  to  query  documents  satisfy  some 
of  the  needs  of  querying  semi-structured  data.  For  instance,  query  languages  for 
structured  documents  such  as  OQL-doc  [CACS94]  and  integration  with  infor¬ 
mation  retrieval  tools  [ACC+96,  CM95]  share  many  goals  with  the  issues  that 
we  are  considering.  The  work  on  query  languages  for  hypertext  structures,  e.g., 
[MW95,  BK90,  CM89b,  MW93]  and  query  languages  for  the  Web  are  relevant. 
In  particular,  query  languages  for  the  Web  have  attracted  a  lot  of  attention 
recently,  e.g.,  W3QL  [KS95]  that  focuses  on  extensibility,  WebSQL  [MMM96] 
that  provides  a  formal  semantics  and  introduce  a  notion  of  locality,  or  WebLog 
[LSS96]  that  is  based  on  a  Datalog-like  syntax.  A  theory  of  queries  of  the  Web 
is  proposed  in  [AV97]. 

W3QL  is  typical  from  this  line  of  works.  It  notably  allows  the  use  of  Perl 
regular  expressions  and  calls  to  Unix  programs  from  the  where  clause  of  an  SQL- 
like  query,  and  even  calls  to  Web  browsers.  This  is  the  basis  of  a  system  that 
provides  bridges  between  the  database  and  the  Web  technology. 

We  do  not  provide  here  an  extensive  survey  of  that  literature.  We  more 
modestly  focus  on  some  concepts  that  we  believe  are  essential  to  query  semi- 
structured  data.  This  is  considered  next.  Finally,  we  mention  the  issue  of  data 
restructuring. 

4.1  Primitives  for  querying  semi-structured  data 

In  this  section,  we  mention  some  recent  proposals  for  querying  semi-structured 
data. 


Using  an  object  approach:  The  notion  of  objects  and  the  flexibility  brought 
by  an  object  approach  turn  out  to  be  essential.  Objects  allow  to  focus  on  the 
portion  of  the  structure  that  is  relevant  to  the  query  and  ignore  portions  of  it 
that  we  (want  to)  ignore. 

To  see  that,  consider  first  the  relational  representation  of  OEM  that  was 
described  in  Section  3.2  and  relational  query  languages.  We  can  express  simple 
queries  such  as  what  is  the  address  of  Toto?  even  when  we  ignore  the  exact 
structure  of  person  objects,  or  even  if  all  persons  do  not  have  the  same  structure: 

select  unique  V’.2 

from  persons  P,  MEMBER  N,  MEMBER  A,  VAL  V,  VAL  V’ 
where  P  =  I . 1  and  P  =  A . 1  and 

1.2  =  "name"  and  1.3  =  V.l  and  V.2  =  "Toto"  and 
A. 2  =  "address"  and  A. 3  =  V’.l 

assuming  a  unary  relation  persons  contains  the  oid’s  of  all  persons.  Observe  that 
this  is  only  assuming  that  persons  have  names  and  addresses. 

In  this  manner,  we  can  query  semi-structured  data  with  almost  no  knowledge 
on  the  underlying  structure  using  the  standard  relational  model.  However,  the 
expression  of  the  query  is  rather  awkward.  Furthermore,  this  representation  of 
the  data  results  in  losing  the  “logical  clustering”  of  data.  The  description  of  an 
object  (a  tuple  or  a  collection)  is  split  into  pieces,  one  triplet  for  each  component. 
A  more  natural  way  to  express  the  same  query  is: 

Q1  select  A  from  persons  P,  P. address  A 
where  "Toto"  =  P.name 

This  is  actually  the  correct  OQL  syntax;  but  OQL  would  require  persons  to  be 
an  homogeneous  set  of  objects,  fitting  the  ODMG  model.  On  the  other  hand, 
Lorel  (based  on  OEM)  would  impose  no  restriction  on  the  types  of  objects  in 
the  persons  set  and  Q1  is  also  a  correct  Lorel  query.  In  OEM,  persons  object  will 
be  allowed  to  have  zero,  one  or  more  names  and  addresses.  Of  course,  the  Lorel 
query  Q1  will  retrieve  only  persons  with  a  name  and  an  address.  Lorel  achieves 
this  by  an  extensive  use  of  coercion. 


Using  coercion:  A  simple  example  of  coercion  is  found  with  atomic  values. 
Some  source  may  record  a  distance  in  kilometers  and  some  in  miles.  The  system 
can  still  perform  comparison  using  coercion  from  one  measure  to  the  other.  For 
instance,  a  comparison  X  <  Y  where  X  is  in  kilometer  and  Y  in  miles  is  coerced 
into  X  <  mileJ,o_km(Y). 

The  same  idea  of  coercion  can  be  used  for  structure  as  well.  Since  we  can 
neither  assume  regularity  nor  precise  knowledge  of  the  structure,  the  name  or 
address  of  a  person  may  be  atomic  in  some  source,  a  set  in  other  sources,  and  not 
be  recorded  by  a  third.  Lorel  allows  one  to  use  Q1  even  in  such  cases.  This  is  done 
by  first  assuming  that  all  properties  are  set-valued.  The  empty  set  (denoting  the 
absence  of  this  property)  and  the  singleton  set  (denoting  a  functional  property) 
are  simply  special  cases.  The  query  Q1  is  then  transformed  by  coercing  the 
equality  in  P.Name  =  ’’Toto”  into  a  set  membership  ’’Toto”  in  P.Name. 

So,  the  principle  is  to  use  a  data  model  where  all  objects  have  the  same 
interface  and  allow  a  lot  of  flexibility  in  queries.  Indeed,  in  Lorel,  all  objects 
have  the  same  type,  OEM. 


Path  expressions  and  Patterns:  The  simplest  use  of  path  expressions  is  to 
concatenate  attribute  names  as  in  “Guide. restaurant. address. zipcode”.  If  Guide 
is  a  tuple,  with  a  restaurant  field  that  has  an  address  field,  that  has  a  zipcode 
field,  this  is  pure  field  extraction.  But  if  some  properties  are  set-valued  (or  all 
are  set- valued  as  for  OEM),  we  are  in  fact  doing  much  more.  We  are  traversing 
collections  and  flattening  them.  This  is  providing  a  powerful  form  of  navigation 
in  the  database  graph.  Note  that  now  such  a  path  expression  can  be  interpreted 
in  two  ways:  (i)  as  the  set  of  objects  at  the  end  of  the  paths;  and  (ii)  as  the 
paths  themselves.  Languages  such  as  OQL-doc  [CACS94]  consider  paths  as  first 
class  citizen  and  even  allow  the  use  of  path  variables  that  range  over  concrete 
paths  in  the  data  graph. 

Such  simple  path  expressions  can  be  viewed  as  a  form  of  browsing.  Alter¬ 
natively,  they  can  be  viewed  as  specifying  certain  line  patterns  that  have  to 
be  found  in  the  data  graph.  One  could  also  consider  non-line  patterns  such  as 
person  {  name  ,  ss#  },  possibly  with  variables  in  the  style  of  the  psi-terms 
[AKP93], 

Extended  path  expressions:  The  notion  of  path  expression  takes  its  full 
power  when  we  start  using  it  in  conjunction  with  wild  cards  or  path  variables. 
Intuitively,  a  sequence  of  labels  describes  a  directed  path  in  the  data  graph,  or  a 
collection  of  paths  (because  of  set- valued  properties).  If  we  consider  a  regular  ex¬ 
pression  of  the  alphabet  of  labels,  it  describes  a  (possibly  infinite)  set  of  words,  so 
again  a  set  of  paths,  i.e.,  the  union  of  the  paths  described  by  each  word.  Indeed, 
this  provides  an  alternative  (much  more  powerful  way)  of  describing  paths. 

Furthermore,  recall  that  labels  are  string,  so  they  are  themselves  sequences 
of  characters.  So  we  can  use  also  regular  expressions  to  describe  labels.  This  is 
posing  some  minor  syntactic  problems  since  we  need  to  distinguish  between  the 
regular  expressions  for  the  sequence  of  labels  and  for  the  sequence  of  characters 
for  each  label.  The  approach  taken  in  Lorel  is  based  on  “wild  cards” .  We  briefly 
discuss  it  next. 

To  take  again  an  example  from  Lorel,  suppose  we  want  to  find  the  names  and 
zipcodes  of  all  “cheap”  restaurants.  Suppose  we  don’t  know  whether  the  zipcode 
occurs  as  part  of  an  address  or  directly  as  subobject  of  restaurants.  Also,  we  do 
not  know  if  the  string  “cheap”  will  be  part  of  a  category,  price,  description,  or 
other  subobject.  We  are  still  able  to  ask  the  query  as  follows: 

select  R.name,  R( .address)? .zipcode 
from  Guide .restaurant  R 
where  R."/,  grep  "cheap" 

The  “?”  after  .address  means  that  the  address  is  optional  in  the  path  expression. 
The  wild-card  “%”  will  match  any  label  leading  a  subobject  of  restaurant.  The 
comparison  operator  grep  will  return  true  if  the  string  “cheap”  appears  anywhere 
in  that  subobject  value.  There  is  no  equivalent  query  in  SQL  or  OQL,  since 
neither  allow  regular  expressions  or  wild-cards. 

This  last  example  seems  again  amenable  to  a  relational  calculus  translation 
although  the  use  of  a  number  of  %  wildcards  may  lead  to  some  very  intricate 
relational  calculus  equivalent,  and  so  would  the  introduction  of  disjunction.  Note 
that  the  Kleene  closure  in  label  sequences  built  in  path  expressions  in  [AQM+96] 
and  OQL-doc  [CACS94]  takes  immediately  out  of  first  order.  For  instance,  con¬ 
sider  the  following  Lorel  query: 


select  t  from  MyReport .#. title  t 


where  is  a  shorthand  for  for  a  sequence  of  arbitrary  many  labels.  This 

returns  the  title  of  my  report,  but  also  the  titles  of  the  section,  subsections,  etc., 
no  matter  how  deeply  nested. 

The  notion  of  path  expression  is  found  first  in  [MBW80]  and  more  recently, 
for  instance,  in  [KKS92,  CACS94,  AQM+96].  Extended  path  expressions  is  a  very 
powerful  primitive  construct  that  changes  the  languages  in  essential  ways.  The 
study  of  path  expressions  and  their  expressive  power  (e.g.,  compared  to  Datalog¬ 
like  languages)  is  one  of  the  main  theoretical  issues  in  the  context  of  semi- 
structured  data.  The  optimization  of  the  evaluation  of  extended  path  expressions 
initiated  in  [CCM96]  is  also  a  challenging  problem. 

Gluing  information  and  rest  variables:  As  mentioned  above,  a  difficulty  for 
languages  for  semi-structured  data  is  that  collections  are  heterogeneous  and  that 
often  the  structure  of  their  components  is  unknown.  Returning  to  the  persons 
example,  we  might  want  to  say  that  we  are  concerned  only  with  persons  having 
a  name,  an  address,  and  possibly  other  fields.  MSL  [PGMW95]  uses  the  notion 
of  rest  variables  to  mention  “possibly  other  fields”  as  for  instance  in: 

res(name:X,  address:Y;  REST1)  r(name:X,  address:Y;  REST1), 

Y  =  (city: "Palo  Alto";  REST2) 

Here  r  is  an  collection  of  heterogeneous  tuples.  The  first  literal  in  the  body  of 
the  rule  will  unify  with  any  tuple  with  a  name  and  address.  The  REST  1  variable 
will  unify  with  the  remaining  part  of  the  tuple.  Observe  that  this  allows  filtering 
the  tuples  in  r  without  having  to  specify  precisely  their  internal  structure. 

This  approach  is  in  the  spirit  of  some  works  in  the  functional  programming 
community  to  allow  dealing  with  heterogeneous  records,  e.g,  [Wan89,  CM89a, 
Rem91].  One  of  the  main  features  is  the  use  of  extensible  records  that  are  the 
basis  of  inheritance  for  objects  as  records.  However,  the  situation  turns  out  to 
be  much  simpler  in  MSL  since:  (i)  there  is  much  less  emphasis  on  typing;  and 
(ii)  in  particular,  it  is  not  assumed  that  a  tuple  has  at  most  one  (-component  for 
a  given  label  l. 

Object  identity  is  also  used  in  MSL  [PAGM96]  to  glue  information  coming 
from  possibly  heterogeneous  various  objects.  For  instance,  the  following  two  rules 
allow  to  merge  the  data  from  two  sources  using  name  as  a  surrogate: 

&person(X)  (  name:X,  ATT : Y  )  rl  (  name:X,  ATT : Y  ) 

&person(X)  (  name:X,  ATT : Y  )  r2  (  name:X,  ATT : Y  ) 

Here  k.person(X)  is  an  object  identifier  and  ATT  is  a  variable.  Intuitively,  for 
each  tuple  in  rl  (or  r2)  with  a  name  field  X,  and  some  ATT  field  Y,  the  object 
k.person(X )  will  have  an  ATT  field  with  value  Y.  Observe  the  use  of  object 
identity  as  a  substitute  for  specifying  too  precisely  the  structure.  Because  of 
object  identity,  we  do  not  need  to  use  a  notion  such  as  REST  variable  to  capture 
in  one  rule  instantiation  all  the  necessary  information. 

We  should  observe  again  that  these  can  be  viewed  as  Datalog  extensions  that 
were  introduced  for  practical  motivations.  Theoretical  result  in  this  area  are  still 
missing. 


4.2  Views  and  restructuring 

Database  languages  are  traditionally  used  for  extracting  data  from  a  database. 
They  also  serve  to  specify  views.  The  notion  of  view  is  particularly  important 


here  since  we  often  want  to  consider  the  same  object  from  various  perspectives  or 
with  various  precisions  in  its  structure  (e.g.,  for  the  integration  of  heterogeneous 
data).  We  need  to  specify  complex  restructuring  operations.  The  view  technology 
developed  for  object  databases  can  be  considered  here,  e.g.,  [dSAD94].  But  we 
dispose  of  much  less  structure  to  start  with  when  defining  the  view  and  again, 
arbitrarily  deep  nesting  and  cycles  pose  new  challenges. 


Declarative  specification  of  a  view:  Following  [dSAD94],  a  view  can  be  de¬ 
fined  by  specifying  the  following:  (i)  how  the  object  population  is  modified  by 
hiding  some  objects  and  creating  virtual  objects;  and  how  the  relationship  be¬ 
tween  objects  is  modified  by  hiding  and  adding  edges  between  objects,  or  mod¬ 
ifying  edge  labels. 

A  simple  approach  consists  of  adding  hide/create  vertices/edges  primitives 
to  the  language  and  using  the  core  query  language  to  specify  the  vertices/edges 
to  hide  and  create.  This  would  yield  a  syntax  in  the  style  of: 

define  view  Salary  with 

hide  select  P. salary  from  persons  P 
where  P. salary  >  100K 

virtual  add  P. salary  :=  "high"  from  persons  P 
where  P. salary  >  100K 

For  vertex  creation  one  could  use  a  Skolem-based  object  naming  [KKS92]. 

The  declarative  specification  of  data  restructuring  for  semi-structured  data 
is  also  studied  in  [ACM97]. 


A  more  procedural  approach  A  different  approach  is  followed  in  [BDHS96] 
in  the  languages  UnQL  and  UnCAL.  A  first  layer  of  UnQL  allows  one  to  ask 
queries  and  is  in  the  style  of  other  proposals  such  as  OQL-doc  or  Lorel,  e.g., 
it  uses  wild  cards.  The  language  is  based  on  a  comprehension  syntax.  Parts  of 
UnQL  are  of  a  declarative  flavor.  On  the  other  hand,  we  view  the  restructuring 
part  as  more  procedural  in  essence.  This  opinion  is  clearly  debatable. 

A  particular  aspect  of  the  language  is  that  it  allows  some  form  of  restruc¬ 
turing  even  for  cyclic  structures.  A  traverse  construct  allows  one  to  transform  a 
database  graph  while  traversing  it,  e.g.,  by  replacing  all  labels  A  by  the  label  A' . 
This  powerful  operation  combines  tree  rewriting  techniques  with  some  control 
obtained  by  a  guided  traversal  of  the  graph.  For  instance,  one  could  specify  that 
the  replacement  occurs  only  if  particular  edge,  say  B,  is  encountered  on  the  way 
from  the  root. 

A  lambda  calculus  for  semi-structured  data,  called  UnCAL,  is  also  presented 
in  [BDHS96]  and  the  equivalence  with  UnQL  is  proven.  This  yields  a  framework 
for  an  (optimized)  evaluation  of  UnQL  queries.  In  particular,  it  is  important 
to  be  able  to  restructure  a  graph  by  local  transformations  (e.g.,  if  the  graph 
is  distributed  as  it  is  the  case  in  the  Web).  The  locality  of  some  restructuring 
operations  is  exploited  in  [Suc96]. 
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